Software I : Anaconda, AstroPy, and libraries

We will make extensive use of Python and various associated libraries and so the first thing we need to ensure is that we all have a common setup and are using the same software. The Python distribution that we have decided to use is Anaconda which can be downloaded from here (although we hope that you have already done this prior to the school). Make sure that you installed the Python 2.7 version for your operating system (there is nothing wrong with Python 3.x but it is slightly different syntactically). Python 2 will be supported until 2020. We promote to write code that supports Python 2.7 and 3.x simultaneously, and you can use the future package explained here

For installing Anaconda in Linux follow this link

For installing Anaconda in Mac follow this link

For installing Anaconda in Windows follow this link

We will follow the LSST Data Managment Style guide explained here

Installing packages

One of the advantages of the Anaconda distribution is that it comes with many of the most commonly-used Python packages, such as numpy, scipy, and scikit-learn, preinstalled. However, if you do need to install a new package then it is very straightforward: you can either use the Anaconda installation tool conda or the generic Python tool pip (both use a different registry of available packages and sometimes a particular package will not available via one tool but will be via the other).

For example, irlbpy is a superfast algorithm for finding the largest eigenvalues (and corresponding eigenvectors) of very large matrices. We can try to install it first with conda:

conda install irlbpy

but this will not find it:

Fetching package metadata: .... Error: No packages found in current osx-64 channels matching: irlbpy

You can search for this package on Binstar with
binstar search -t conda irlbpy </code>

so instead we try with pip:

pip install irlbpy

In the event that both fail, you always just download the package source code and then install it manually with:

python install setup.py

in the appropriate source directory.

We'll now take a brief look at a few of the main Python packages.

Python interpreter

The standard way to use the Python programming language is to use the Python interpreter to run python code. The python interpreter is a program that reads and execute the python code in files passed to it as arguments. At the command prompt, the command python is used to invoke the Python interpreter.

For example, to run a file my-program.py that contains python code from the command prompt, use::

$ python my-program.py

We can also start the interpreter by simply typing python at the command line, and interactively type python code into the interpreter.

This is often how we want to work when developing scientific applications, or when doing small calculations. But the standard python interpreter is not very convenient for this kind of work, due to a number of limitations.

IPython

IPython is an interactive shell that addresses the limitation of the standard python interpreter, and it is a work-horse for scientific use of python. It provides an interactive prompt to the python interpreter with a greatly improved user-friendliness.

Some of the many useful features of IPython includes:

Command history, which can be browsed with the up and down arrows on the keyboard. Tab auto-completion. In-line editing of code. Object introspection, and automatic extract of documentation strings from python objects like classes and functions. Good interaction with operating system shell. Support for multiple parallel back-end processes, that can run on computing clusters or cloud services like Amazon EE2.

Jupyter notebook

Jupyter notebook is an HTML-based notebook environment for Python, similar to Mathematica or Maple. It is based on the IPython shell, but provides a cell-based environment with great interactivity, where calculations can be organized and documented in a structured way.

Although using a web browser as graphical interface, IPython notebooks are usually run locally, from the same computer that run the browser. To start a new Jupyter notebook session, run the following command:

$ jupyter notebook

from a directory where you want the notebooks to be stored. This will open a new browser window (or a new tab in an existing window) with an index page where existing notebooks are shown and from which new notebooks can be created. Usually, the URL for the Jupyter notebook is http://localhost:8888

NumPy

NumPy is the main Python package for working with N-dimensional arrays. Any list of numbers can be recast as a NumPy array:


In [58]:
import numpy as np
x = np.array([1, 5, 3, 4, 2])
x


Out[58]:
array([1, 5, 3, 4, 2])

Arrays have a number of useful methods associated with them:


In [59]:
print x.min(), x.max(), x.sum(), x.argmin(), x.argmax()


1 5 15 0 1

and NumPy functions can act on arrays in an elementwise fashion:


In [60]:
np.sin(x * np.pi / 180.)


Out[60]:
array([ 0.01745241,  0.08715574,  0.05233596,  0.06975647,  0.0348995 ])

Ranges of values are easily produced:


In [61]:
np.arange(1, 10, 0.5)


Out[61]:
array([ 1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,
        6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [62]:
np.linspace(1, 10, 5)


Out[62]:
array([  1.  ,   3.25,   5.5 ,   7.75,  10.  ])

In [63]:
np.logspace(1, 3, 5)


Out[63]:
array([   10.        ,    31.6227766 ,   100.        ,   316.22776602,
        1000.        ])

Random numbers are also easily generated in the half-open interval [0, 1):


In [64]:
help(np.random.random)


Help on built-in function random_sample:

random_sample(...)
    random_sample(size=None)
    
    Return random floats in the half-open interval [0.0, 1.0).
    
    Results are from the "continuous uniform" distribution over the
    stated interval.  To sample :math:`Unif[a, b), b > a` multiply
    the output of `random_sample` by `(b-a)` and add `a`::
    
      (b - a) * random_sample() + a
    
    Parameters
    ----------
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    
    Returns
    -------
    out : float or ndarray of floats
        Array of random floats of shape `size` (unless ``size=None``, in which
        case a single float is returned).
    
    Examples
    --------
    >>> np.random.random_sample()
    0.47108547995356098
    >>> type(np.random.random_sample())
    <type 'float'>
    >>> np.random.random_sample((5,))
    array([ 0.30220482,  0.86820401,  0.1654503 ,  0.11659149,  0.54323428])
    
    Three-by-two array of random numbers from [-5, 0):
    
    >>> 5 * np.random.random_sample((3, 2)) - 5
    array([[-3.99149989, -0.52338984],
           [-2.99091858, -0.79479508],
           [-1.23204345, -1.75224494]])


In [65]:
np.random.random(10)


Out[65]:
array([ 0.58512266,  0.31375932,  0.76770149,  0.51873712,  0.89672698,
        0.72973333,  0.01459332,  0.51402772,  0.64779699,  0.48416579])

or from one of the large number of statistical distributions provided:


In [66]:
np.random.normal(loc = 2.5, scale = 5, size = 10)


Out[66]:
array([ 3.19074595, -7.39056813,  0.08293087, -0.25255838, -4.95182036,
       -2.63537978, -8.22688495,  5.6114966 ,  5.06056232, -1.36367878])

Another useful method is the where function for identifying elements that satisfy a particular condition:


In [67]:
x = np.random.normal(size = 100)
np.where(x > 3.)


Out[67]:
(array([], dtype=int64),)

Of course, all of these work equally well with multidimensional arrays.


In [68]:
x = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
np.sin(x)


Out[68]:
array([[ 0.84147098,  0.90929743,  0.14112001, -0.7568025 , -0.95892427],
       [-0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849, -0.54402111]])

Data can also be automatically loaded from a file into a Numpy array via the loadtxt or genfromtxt methods:


In [69]:
data = np.loadtxt("data/sample_data.csv", delimiter = ",", skiprows = 3)
data


Out[69]:
array([[  3.00000000e+00,   2.01200000e+03,   4.20000000e+04],
       [  4.00000000e+00,   2.01300000e+03,   4.50000000e+04],
       [  5.00000000e+00,   2.01400000e+03,   4.70000000e+04],
       [  6.00000000e+00,   2.01500000e+03,   5.00000000e+04]])

Matplotlib

Matplotlib is Python's most popular and comprehensive plotting library that is especially useful in combination with NumPy/SciPy.


In [70]:
import matplotlib.pyplot as plt
%matplotlib inline

x=np.array([1,2,3,4,5,6,7,8,9,10])
y=x**2

plt.plot(x,y)
plt.xlabel('X-axis title')
plt.ylabel('Y-axis title')


Out[70]:
<matplotlib.text.Text at 0x10e6d9350>

In [71]:
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')


Out[71]:
[<matplotlib.lines.Line2D at 0x10e5ea990>,
 <matplotlib.lines.Line2D at 0x10e5eab90>,
 <matplotlib.lines.Line2D at 0x10e5f9210>]

In [72]:
help(plt.hist)


Help on function hist in module matplotlib.pyplot:

hist(x, bins=10, range=None, normed=False, weights=None, cumulative=False, bottom=None, histtype=u'bar', align=u'mid', orientation=u'vertical', rwidth=None, log=False, color=None, label=None, stacked=False, hold=None, data=None, **kwargs)
    Plot a histogram.
    
    Compute and draw the histogram of *x*. The return value is a
    tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
    [*patches0*, *patches1*,...]) if the input contains multiple
    data.
    
    Multiple data can be provided via *x* as a list of datasets
    of potentially different length ([*x0*, *x1*, ...]), or as
    a 2-D ndarray in which each column is a dataset.  Note that
    the ndarray form is transposed relative to the list form.
    
    Masked arrays are not supported at present.
    
    Parameters
    ----------
    x : (n,) array or sequence of (n,) arrays
        Input values, this takes either a single array or a sequency of
        arrays which are not required to be of the same length
    
    bins : integer or array_like, optional
        If an integer is given, `bins + 1` bin edges are returned,
        consistently with :func:`numpy.histogram` for numpy version >=
        1.3.
    
        Unequally spaced bins are supported if `bins` is a sequence.
    
        default is 10
    
    range : tuple or None, optional
        The lower and upper range of the bins. Lower and upper outliers
        are ignored. If not provided, `range` is (x.min(), x.max()). Range
        has no effect if `bins` is a sequence.
    
        If `bins` is a sequence or `range` is specified, autoscaling
        is based on the specified bin range instead of the
        range of x.
    
        Default is ``None``
    
    normed : boolean, optional
        If `True`, the first element of the return tuple will
        be the counts normalized to form a probability density, i.e.,
        ``n/(len(x)`dbin)``, i.e., the integral of the histogram will sum
        to 1. If *stacked* is also *True*, the sum of the histograms is
        normalized to 1.
    
        Default is ``False``
    
    weights : (n, ) array_like or None, optional
        An array of weights, of the same shape as `x`.  Each value in `x`
        only contributes its associated weight towards the bin count
        (instead of 1).  If `normed` is True, the weights are normalized,
        so that the integral of the density over the range remains 1.
    
        Default is ``None``
    
    cumulative : boolean, optional
        If `True`, then a histogram is computed where each bin gives the
        counts in that bin plus all bins for smaller values. The last bin
        gives the total number of datapoints.  If `normed` is also `True`
        then the histogram is normalized such that the last bin equals 1.
        If `cumulative` evaluates to less than 0 (e.g., -1), the direction
        of accumulation is reversed.  In this case, if `normed` is also
        `True`, then the histogram is normalized such that the first bin
        equals 1.
    
        Default is ``False``
    
    bottom : array_like, scalar, or None
        Location of the bottom baseline of each bin.  If a scalar,
        the base line for each bin is shifted by the same amount.
        If an array, each bin is shifted independently and the length
        of bottom must match the number of bins.  If None, defaults to 0.
    
        Default is ``None``
    
    histtype : {'bar', 'barstacked', 'step',  'stepfilled'}, optional
        The type of histogram to draw.
    
        - 'bar' is a traditional bar-type histogram.  If multiple data
          are given the bars are aranged side by side.
    
        - 'barstacked' is a bar-type histogram where multiple
          data are stacked on top of each other.
    
        - 'step' generates a lineplot that is by default
          unfilled.
    
        - 'stepfilled' generates a lineplot that is by default
          filled.
    
        Default is 'bar'
    
    align : {'left', 'mid', 'right'}, optional
        Controls how the histogram is plotted.
    
            - 'left': bars are centered on the left bin edges.
    
            - 'mid': bars are centered between the bin edges.
    
            - 'right': bars are centered on the right bin edges.
    
        Default is 'mid'
    
    orientation : {'horizontal', 'vertical'}, optional
        If 'horizontal', `~matplotlib.pyplot.barh` will be used for
        bar-type histograms and the *bottom* kwarg will be the left edges.
    
    rwidth : scalar or None, optional
        The relative width of the bars as a fraction of the bin width.  If
        `None`, automatically compute the width.
    
        Ignored if `histtype` is 'step' or 'stepfilled'.
    
        Default is ``None``
    
    log : boolean, optional
        If `True`, the histogram axis will be set to a log scale. If `log`
        is `True` and `x` is a 1D array, empty bins will be filtered out
        and only the non-empty (`n`, `bins`, `patches`) will be returned.
    
        Default is ``False``
    
    color : color or array_like of colors or None, optional
        Color spec or sequence of color specs, one per dataset.  Default
        (`None`) uses the standard line color sequence.
    
        Default is ``None``
    
    label : string or None, optional
        String, or sequence of strings to match multiple datasets.  Bar
        charts yield multiple patches per dataset, but only the first gets
        the label, so that the legend command will work as expected.
    
        default is ``None``
    
    stacked : boolean, optional
        If `True`, multiple data are stacked on top of each other If
        `False` multiple data are aranged side by side if histtype is
        'bar' or on top of each other if histtype is 'step'
    
        Default is ``False``
    
    Returns
    -------
    n : array or list of arrays
        The values of the histogram bins. See **normed** and **weights**
        for a description of the possible semantics. If input **x** is an
        array, then this is an array of length **nbins**. If input is a
        sequence arrays ``[data1, data2,..]``, then this is a list of
        arrays with the values of the histograms for each of the arrays
        in the same order.
    
    bins : array
        The edges of the bins. Length nbins + 1 (nbins left edges and right
        edge of last bin).  Always a single array even when multiple data
        sets are passed in.
    
    patches : list or list of lists
        Silent list of individual patches used to create the histogram
        or list of such list if multiple input datasets.
    
    Other Parameters
    ----------------
    kwargs : `~matplotlib.patches.Patch` properties
    
    See also
    --------
    hist2d : 2D histograms
    
    Notes
    -----
    Until numpy release 1.5, the underlying numpy histogram function was
    incorrect with `normed`=`True` if bin sizes were unequal.  MPL
    inherited that error.  It is now corrected within MPL when using
    earlier numpy versions.
    
    Examples
    --------
    .. plot:: mpl_examples/statistics/histogram_demo_features.py
    
    Notes
    -----
    
    In addition to the above described arguments, this function can take a
    **data** keyword argument. If such a **data** argument is given, the
    following arguments are replaced by **data[<arg>]**:
    
    * All arguments with the following names: 'x', 'weights'.
    
    
    
    
    Additional kwargs: hold = [True|False] overrides default hold state


In [73]:
x=10+5*np.random.randn(10000)

n, bins, patches=plt.hist(x,bins=30)


AstroPy

AstroPy aims to provide a core set of subpackages to specifically support astronomy. These include methods to work with image and table data formats, e.g., FITS, VOTable, etc., along with astronomical coordinate and unit systems, and cosmological calculations.


In [74]:
from astropy import units as u
from astropy.coordinates import SkyCoord

c = SkyCoord(ra = 10.625 * u.degree, dec = 41.2 * u.degree, frame = 'icrs')
print c.to_string('hmsdms')
print c.galactic


00h42m30s +41d12m00s
<SkyCoord (Galactic): (l, b) in deg
    (121.12334339, -21.6403587)>

In [75]:
from astropy.cosmology import WMAP9 as cosmo
print cosmo.comoving_distance(1.25), cosmo.luminosity_distance(1.25)


3944.5841858 Mpc 8875.31441806 Mpc

You can read FITS images and VOTables using astropy


In [76]:
from astropy.io import fits
hdulist = fits.open('data/sample_image.fits')
hdulist.info()


Filename: data/sample_image.fits
No.    Name         Type      Cards   Dimensions   Format
0    PRIMARY     PrimaryHDU     161   (891, 893)   int16   
1    er.mask     TableHDU        25   1600R x 4C   [F6.2, F6.2, F6.2, F6.2]   

In [77]:
data=hdulist[0].data
data


Out[77]:
array([[ 7201,  6642,  6642, ...,  9498,  9498, 10057],
       [ 6642,  6363,  6642, ..., 10057, 10616, 10616],
       [ 6922,  6642,  6922, ..., 10337, 11175, 10616],
       ..., 
       [ 5412,  5132,  5412, ..., 13000, 12580, 12021],
       [ 5796,  5517,  5796, ..., 12546, 12546, 11987],
       [ 5796,  5796,  6076, ..., 11987, 12546, 12546]], dtype=int16)

In [78]:
from astropy.io.votable import parse
votable = parse('data/sample_votable.xml')
table = votable.get_first_table()

In [79]:
fields=table.fields
data=table.array

print fields
print data


[<FIELD ID="Wavelength" datatype="float" name="Wavelength" ucd="em.wl" unit="AA" utype="spec:Data.SpectralAxis.Value"/>, <FIELD ID="Transmission" datatype="float" name="Transmission" ucd="phys.transmission" utype="spec:Data.FluxAxis.Value"/>]
[(12890.0, 0.0) (13150.0, 0.0) (13410.0, 0.0) (13680.0, 0.0) (13970.0, 0.0)
 (14180.0, 0.0) (14400.0, 0.0005000000237487257)
 (14620.0, 0.00279999990016222) (14780.0, 0.008100000210106373)
 (14860.0, 0.028699999675154686) (14930.0, 0.08709999918937683)
 (15040.0, 0.2013999968767166) (15150.0, 0.4381999969482422)
 (15280.0, 0.6863999962806702) (15390.0, 0.8180999755859375)
 (15460.0, 0.882099986076355) (15510.0, 0.9118000268936157)
 (15560.0, 0.9269000291824341) (15650.0, 0.9293000102043152)
 (15720.0, 0.8726999759674072) (15770.0, 0.8565999865531921)
 (15830.0, 0.8826000094413757) (15920.0, 0.9180999994277954)
 (15970.0, 0.9266999959945679) (16020.0, 0.9075999855995178)
 (16130.0, 0.9259999990463257) (16190.0, 0.9204999804496765)
 (16280.0, 0.9241999983787537) (16330.0, 0.9235000014305115)
 (16420.0, 0.9417999982833862) (16480.0, 0.9491000175476074)
 (16570.0, 0.9807000160217285) (16590.0, 0.9937000274658203) (16710.0, 1.0)
 (16840.0, 0.9560999870300293) (17010.0, 0.9240999817848206)
 (17150.0, 0.9821000099182129) (17270.0, 0.991599977016449)
 (17390.0, 0.9886999726295471) (17460.0, 0.979200005531311)
 (17510.0, 0.9682000279426575) (17530.0, 0.9369999766349792)
 (17560.0, 0.9190000295639038) (17640.0, 0.8422999978065491)
 (17750.0, 0.6671000123023987) (17850.0, 0.2694000005722046)
 (17900.0, 0.45159998536109924) (17960.0, 0.17309999465942383)
 (18030.0, 0.10769999772310257) (18100.0, 0.07069999724626541)
 (18130.0, 0.005100000184029341) (18180.0, 0.019999999552965164)
 (18280.0, 0.00039999998989515007) (18350.0, 0.0)
 (18500.0, 9.999999747378752e-05) (18710.0, 0.0) (18930.0, 0.0)
 (19140.0, 0.0)]

A useful affiliated package is Astroquery which provides tools for querying astronomical web forms and databases. This is not part of the regular AstroPy distribution and needs to be installed separately. Whereas many data archives have standardized VO interfaces to support data access, Astroquery mimics a web browser and provides access via an archive's form interface. This can be useful as not all provided information is necesarily available via the VO.

For example, the NASA Extragalactic Database is a very useful human-curated resource for extragalactic objects. However, a lot of the information that is available via the web pages is not available through an easy programmatic API. Let's say that we want to get the list of object types associated with a particulae source:


In [80]:
from astroquery.ned import Ned
coo = SkyCoord(ra = 56.38, dec = 38.43, unit = (u.deg, u.deg))
result = Ned.query_region(coo, radius = 0.07 * u.deg)
set(result.columns['Type'])


Out[80]:
{'G', 'RadioS', 'UvS'}

SciPy

SciPy provides a number of subpackages that deal with common operations in scientific computing, such as numerical integration, optimization, interpolation, Fourier transforms and linear algebra.


In [81]:
f = lambda x: np.cos(-x ** 2 / 9.)

x = np.linspace(0, 10, 11)
y = f(x)

from scipy.interpolate import interp1d
f1 = interp1d(x, y)
f2 = interp1d(x, y, kind = 'cubic')

from scipy.integrate import quad
print quad(f1, 0, 10)
print quad(f2, 0, 10)
print quad(f, 0, 10)


(1.6346035509274763, 1.1580617576001373e-08)
(1.2743983238992254, 1.1301087878068563e-08)
(1.4332524555959525, 5.534144099065977e-11)

Scikit-learn

scikit-learn provides algorithms for machine learning tasks, such as classification, regression, and clustering, as well as associated operations, such as cross-validation and feature normalization. A related module is astroML which is a wrapper around a lot of the scikit-learn routines but also offers some additional functionality and faster/alternate implementations of some methods.

Pandas

pandas offers data structures, particularly data frames, and operations for manipulating numerical tables and time series, such as fancy indexing, reshaping and pivoting, and merging, as well as a number of analysis tools. Although similar functionality already exists in numpy, pandas is highly optimized for performance and large data sets.

Other libraries

For some of the other lectures or projects this week, you might also need to install the following Python packages:

  • photutils
  • glue